Dynamic Modelling of Microarray Time Course Data
نویسندگان
چکیده
The analysis of gene expression profiles, obtained from DNA microarray experiments, is used to discover relationships between genes and to discern groups of genes involved common processes. The principal aim of this paper is to introduce dynamic modelling of microarray time course data. A novel approach to identify similar gene expression profiles is presented. Using parametric modelling, we define a distance between expression profiles that identifies genes with similar dynamic responses opposed to distances among vectors. This approach provides an intuitive interpretation of similarity in the time domain and allows fast clustering using nearest neighbors. 1 DNA Microarray Data For the data considered in this paper we assume that gene expression profiles were obtained from time course DNA microarray experiments. To illustrate the idea and without loss of generality, we use the yeast data set described by Eisen et al. [ESBB98]. The data set is chosen to illustrate the proposed concept and no reference is made to the biology or the experiments which generated the data. The data matrixX has i = 1, . . . , n rows representing genes and j = 1, . . . , r1, r1+1, . . . , r1+ r2, . . . , ∑m−1 i=1 ri+1, . . . , ∑m i=1 ri = r columns of samples grouped intom experiments consisting of r1, r2,..., rm measurements respectively. The data set discussed in [ESBB98] describes n = 2647 genes for m = 9 experiments. The number of measurements per experiment varies between 4 and 18. A common approach is to combine the gene expression profiles of a number of experiments into a single row vector. Referring to the distances between row vectors, this makes sense and can be useful for clustering directly on the matrix X. For example, by using the Euclidean distance, the same importance is given to any data point in a row, independently of the experiment to which it belongs. Other weighted distances could in principle be used to give different importance to each experiment. Although creating a larger sample, this approach makes the assumption that a similar response of genes in unrelated experiments provides stronger evidence for common function. Considering time course data, grouping times series of unrelated experiments, does not make any sense since a model for the combined experiments will explain individual characteristics only poorly. For no reason other than to illustrate the basic ideas, we have worked with the time series corresponding to the first experiment, where n = 2647 and m = r1 = 18. Each row vector xi = [xi1, . . . , xij , . . . , xi18] represents a particular gene expression profile. Throughout the paper we refer to a gene by its (row) number in X. xij is the gene expression level at time j of gene i. xij ∈ xi is the normalized log2 ratio Eij/Rij where Eij is the expression ∗Author to whom correspondence should be addressed. The research was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant GR/L95151 (‘Diagnostic Signal Analysis’) and GR/N21871 (‘Genetic Systems’). 2 4 6 8 10 12 14 16 18 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 4 6 8 10 12 14 16 18 -1 -0.5 0 0.5 Fig. 1: On the left, three gene expression profiles (1274, 1281 and 1745) with the same dynamic response pattern but shifted in time. On the right, two signals with the same pattern but with difference levels of response (1613, 1384). level or state at time j of the gene i, and Rij is the reference state of the gene, which is a constant value throughout the experiment [BGL00]: xij = log2 ( Eij Rij ) √∑18 k=1 ( log2 ( Eij Rij ))2 i = 1, . . . , 2467 j = 1, . . . , 18 (1) With the normalization of (1), xij is positive when Eij ≥ Rij and we say the gene is induced or “up-regulated”. When Eij ≤ Rij , xij is negative and it is said that the gene is repressed or “down-regulated”. Here we assume that there are no missing values, that is, measurements have been obtained for all points in the time course or missing values have been replaced by a suitable method. 2 Signal Selection and Clustering Virtually all clustering studies published to this date, using the time course data presented in [ESBB98], have ignored the dynamic component of the microarray experiments. It is the principal aim of this paper to suggest a dynamic systems approach to microarray time course data. Using a distance or the correlation between groups of signals it is possible to identify genes with similar shapes of expression profiles. This approach will however discount signals which are similar although one is delayed with respect to the other or one has a stronger or weaker response, but the same dynamic behavior. Figure 1 illustrates this point. The method is summarized by the following steps. Each signal is modelled as an individual time series by using a parametric technique such as ARIMA models. Signals with the same model structure such as AR(2) or MA(1) can then be grouped. Inside each group, similar signals have similar parameters and a clustering algorithm can be applied to identify patterns among the parameters in each group. In this paper we do not go as far as clustering by model structures but illustrate the idea by fitting a simple autoregressive model to the data and a then group genes as nearest neighbors in the parameter space of the models. In microarray experiments studying considering a large number of genes, as in whole genome arrays, we often find that many genes do not show any response leading to noisy signals with no deterministic component. In [NW01], we suggested a number of statistical tests which may be used to select ‘informative’ signals and thereby to reduce the computational costs of the analysis. This selection, however, does remain subjective and with relatively small sample sizes such tests are inevitably unreliable. To achieve the objective of the present paper, it is not necessary to select signals but to illustrate and visualize the idea, we wish to ‘clear’ the rather dense parameter space. For visualization purposes, we therefore discard signals which have no trend nor any dependency among elements of the sample. For this we use the ‘Runs Test’ [NW01] which assumes that such a signal oscillates above and below the median with equal probability. The second criterion that we apply, is to
منابع مشابه
Inferring gene networks from time series microarray data using dynamic Bayesian networks
Dynamic Bayesian networks (DBNs) are considered as a promising model for inferring gene networks from time series microarray data. DBNs have overtaken Bayesian networks (BNs) as DBNs can construct cyclic regulations using time delay information. In this paper, a general framework for DBN modelling is outlined. Both discrete and continuous DBN models are constructed systematically and criteria f...
متن کاملNovel technique for preprocessing high dimensional time-course data from DNA microarray: mathematical model-based clustering
MOTIVATION Classifying genes into clusters depending on their expression profiles is one of the most important analysis techniques for microarray data. Because temporal gene expression profiles are indicative of the dynamic functional properties of genes, the application of clustering analysis to time-course data allows the more precise division of genes into functional classes. Conventional cl...
متن کاملLiterature Review of Traffic Assignment: Static and Dynamic
Rapid urban growth is resulting into increase in travel demand and private vehicle ownership in urban areas. In the present scenario the existing infrastructure has failed to match the demand that leads to traffic congestion, vehicular pollution and accidents. With traffic congestion augmentation on the road, delay of commuters has increased and reliability of road network has decreased. Four s...
متن کاملDynamic modelling of hardness changes of aluminium nanostructure during mechanical ball milling process
In this research, the feasibility of using mathematical modelling in the ball milling process has been evaluated to verify the hardness changes of an aluminium nanostructure. Considering the model of normal force displacement (NFD), the radius of elastic-plastic and normal displacement of two balls were computed by applying analytical modelling and coding in MATLAB. Properties of balls and alum...
متن کاملModeling Gene Expression from Microarray Expression Data with State-Space Equations
We describe a new method to model gene expression from time-course gene expression data. The modelling is in terms of state-space descriptions of linear systems. A cell can be considered to be a system where the behaviours (responses) of the cell depend completely on the current internal state plus any external inputs. The gene expression levels in the cell provide information about the behavio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001